Big Data
Contextual Online Pricing with (Biased) Offline Data
Yixuan Zhang, Department of Industrial & Systems Engineering, University of Wisconsin-Madison, yzhang2554@wisc.edu, "3026 Ruihao Zhu, SC Johnson College of Business, Cornell University, ruihao.zhu@cornell.edu, "3026 Qiaomin Xie, Department of Industrial & Systems Engineering, University of Wisconsin-Madison, qiaomin.xie@wisc.edu
We study contextual online pricing with biased offline data. For the scalar price elasticity case, we identify the instance-dependent quantity δ2 that measures how far the offline data lies from the (unknown) online optimum. We show that the time length T, bias bound V, size N and dispersion λmin(ˆΣ) of the offline data, and δ2 jointly determine the statistical complexity.
Tight Asymptotics of Extreme Order Statistics
A classic statistical problem is to study the asymptotic behavior of the order statistics of a large number of independent samples taken from a distribution with finite expectation. This behavior has implications for several core problems in machine learning and economics -- including robust learning under adversarial noise, best-arm identification in bandit algorithms, revenue estimation in secondprice auctions, and the analysis of tail-sensitive statistics used in out-of-distribution detection. The research question we tackle in this paper is: How large can the expectation of the ℓ-th maximum of the n samples be? For ℓ = 1, i.e., the maximum, this expectation is known to grow as o(n), which can be shown to be tight. We show that there is a sharp contrast when considering any fixed ℓ > 1. Surprisingly, in
Precise Asymptotics and Refined Regret of Variance-Aware UCB
In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for the Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates for UCB-V, extending recent results for the canonical UCB in [21] and [23]. In an interesting contrast to the canonical UCB, our analysis reveals that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for the arm-pulling rates in the high probability regime, offering insights into the regret analysis. As an application of this high probability result, we establish that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms. A matching regret lower bound is also established, demonstrating the optimality of our result.
Greedy Algorithm for Structured Bandits: ASharp Characterization of Asymptotic Success / Failure
We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy--any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. Our characterization extends to contextual bandits and interactive decision-making with arbitrary feedback. Examples demonstrating broad applicability and extensions to infinite reward structures are provided.
Balancing Performance and Costs in Best Arm Identification
We consider the problem of identifying the best arm in a multi-armed bandit model. Despite a wealth of literature in the traditional fixed budget and fixed confidence regimes of the best arm identification problem, it still remains a mystery to most practitioners as to how to choose an approach and corresponding budget or confidence parameter. We propose a new formalism to avoid this dilemma altogether by minimizing a risk functional which explicitly balances the performance of the recommended arm and the cost incurred by learning this arm. In this framework, a cost is incurred for each observation during the sampling phase, and upon recommending an arm, a performance penalty is incurred for identifying a suboptimal arm. The learner's goal is to minimize the sum of the penalty and cost. This new regime mirrors the priorities of many practitioners, e.g.
Design-Based Bandits Under Network Interference: Trade-Off Between Regret and Statistical Inference
In multi-armed bandits with network interference (MABNI), the action taken by one node can influence the rewards of others, creating complex interdependence. While existing research on MABNI largely concentrates on minimizing regret, it often overlooks the crucial concern that an excessive emphasis on the optimal arm can undermine the inference accuracy for sub-optimal arms. Although initial efforts have been made to address this trade-off in single-unit scenarios, these challenges have become more pronounced in the context of MABNI. In this paper, we establish, for the first time, a theoretical Pareto frontier characterizing the trade-off between regret minimization and inference accuracy in adversarial (design-based) MABNI. We further introduce an anytime-valid asymptotic confidence sequence along with a corresponding algorithm, EXP3-N-CS, specifically designed to balance the trade-off between regret minimization and inference accuracy in this setting.
Bandit Guided Submodular Curriculum for Adaptive Subset Selection
Traditional curriculum learning proceeds from easy to hard samples, yet defining a reliable notion of difficulty remains elusive. Prior work has used submodular functions to induce difficulty scores in curriculum learning. We reinterpret adaptive subset selection and formulate it as a multi-armed bandit problem, where each arm corresponds to a submodular function guiding sample selection. We introduce ONLINESUBMOD, a novel online greedy policy that optimizes a utility-driven reward and provably achieves no-regret performance under various sampling regimes. Empirically, ONLINESUBMOD outperforms both traditional curriculum learning and bi-level optimization approaches across vision and language datasets, showing superior accuracy-efficiency tradeoffs. More broadly, we show that validationdriven reward metrics offer a principled way to guide the curriculum schedule. Our code is publicly available at GitHub 2.
Follow-the-Perturbed-Leader Nearly Achieves Best-of-Both-Worlds for the m-Set Semi-Bandit Problems
We consider a common case of the combinatorial semi-bandit problem, the m-set semi-bandit, where the learner exactly selects m arms from the total d arms. In the adversarial setting, the best regret bound, known to be O( nmd) for time horizon n, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy. However, this requires to explicitly compute the arm-selection probabilities via optimizing problems at each time step and sample according to them. This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the m arms that rank among the m smallest (estimated) loss with random perturbation. In this paper, we show that FTPL with a Fréchet perturbation also enjoys the near optimal regret bound O( nm( p dlog(d) + m5/6)) in the adversarial setting and approaches best-of-both-world regret bounds, i.e., achieves a logarithmic regret for the stochastic setting. Moreover, our lower bounds show that the extra factors are unavoidable with our approach; any improvement would require a fundamentally different and more challenging method.
Optimal Best Arm Identification under Differential Privacy
Best Arm Identification (BAI) algorithms are deployed in data-sensitive applications, such as adaptive clinical trials or user studies. Driven by the privacy concerns of these applications, we study the problem of fixed-confidence BAI under global Differential Privacy (DP) for Bernoulli distributions. While numerous asymptotically optimal BAI algorithms exist in the non-private setting, a significant gap remains between the best lower and upper bounds in the global DP setting. This work reduces this gap to a small multiplicative constant, for any privacy budget ϵ. First, we provide a tighter lower bound on the expected sample complexity of any δ-correct and ϵ-global DP strategy.
Constrained Best Arm Identification
In real-world decision-making problems, one needs to pick among multiple policies the one that performs best while respecting economic constraints. This motivates the problem of constrained best-arm identification for bandit problems where every arm is a joint distribution of reward and cost. We investigate the general case where reward and cost are dependent. The goal is to accurately identify the arm with the highest mean reward among all arms whose mean cost is below a given threshold. We prove information-theoretic lower bounds on the sample complexity for three models: Gaussian with fixed covariance, Gaussian with unknown covariance, and non-parametric distributions of rectangular support. We propose a combination of a sampling and a stopping rule that correctly identifies the constrained best arm and matches the optimal sample complexities for each of the three models. Simulations demonstrate the performance of our algorithms.